Most AI demos are lies — not maliciously, just by selection. You see the run that worked, not the nine that didn't. A cherry-picked demo is indistinguishable from a robust system right up until it's in production failing on inputs the demo never tried. Evals are the only way to tell the difference, and the difference is the whole job.
The mistake most teams make is treating evals as a dashboard they glance at once. The version that actually protects you treats them like regression tests: a fixed set, scored automatically, with a threshold that fails the build when quality drops. If you can't put a number on "did this change make it worse," you're running on vibes and calling it judgment.
1. Start With a Fixed Question Set
You can't measure drift without a baseline. The minimum viable eval is a curated set of 50–100 representative inputs with known-good expectations. Every meaningful change — a new model, a prompt edit, a different chunking strategy, a routing change — runs against that same set so the comparison is apples to apples. The set is an asset; grow it every time a real failure slips through (add the case that broke).
2. Score With Metrics, Not Eyeballs
For retrieval and grounded answers, three metrics catch most failures. I run these via RAGAS in my RAG Knowledge Engine:
- Faithfulness: is every claim in the answer grounded in the retrieved context? Catches hallucination.
- Answer relevance: does the answer actually address the question? Catches confident off-topic responses.
- Context recall: did the retriever surface the chunks needed to answer? Catches retrieval failures upstream of the model.
For open-ended agent output where there's no exact string to match, an LLM-as-judge scores the response against a rubric. It's not perfect, but a consistent judge applied to a fixed set reliably surfaces relative regressions — which is what you care about between versions.
3. Test the Pieces, Not Just the Whole
A multi-agent system fails at the seams. Evaluate each agent's contract in isolation before you judge the pipeline. The agents in my agentic-systems repo show the unit-level version: the test-case generator is checked on whether it produces valid, runnable tests from a function signature; the code reviewer on whether it catches seeded bugs. When a pipeline regresses, component evals tell you which stage moved — far faster than staring at end-to-end output.
4. Wire It Into CI
An eval you have to remember to run is an eval you won't run. Put it in the pipeline: on every significant change, the suite runs against the fixed set, and if faithfulness drops below your threshold (I use 0.85 for grounded Q&A), the change doesn't merge. That single rule converts "we think it's better" into "the numbers say it's not worse" — which is the only honest way to ship changes to a probabilistic system.
The Throughline
Evals complete the reliability triad with memory (stop repeating failures) and guardrails (stop drifting). Memory fixes the past, guardrails fix the present, evals protect the future. None of them demo well. All of them are why a system survives contact with real users — the argument I make in full in the boring infrastructure that actually ships.
What I Built
RAGAS-style scoring lives in rag-knowledge-engine; component-level evals across five topologies are in agentic-systems. For the retrieval side of the story, see RAG in production.